28
Quantization of Neural Networks
TABLE 2.1
Evaluating the components of Q-ViT based on the ViT-S backbone.
Method
#Bits
Top-1
#Bits
Top-1
#Bits
Top-1
Full-precision
32-32
79.9
-
-
-
-
Baseline
4-4
79.7
3-3
77.8
2-2
68.2
+IRM
4-4
80.2
3-3
78.2
2-2
69.9
+DGD
4-4
80.4
3-3
78.5
2-2
70.5
+IRM+DGD (Q-ViT)
4-4
80.9
3-3
79.0
2-2
72.0
2.4
Q-DETR: An Efficient Low-Bit Quantized Detection Trans-
former
Drawing inspiration from the achievements in natural language processing (NLP), object
detection using transformers (DETR) has emerged as a new approach for training an end-to-
end detector using a transformer encoder-decoder [31]. In contrast to earlier methods [201,
153] that heavily rely on convolutional neural networks (CNNs) and necessitate additional
post-processing steps such as non-maximum suppression (NMS) and hand-designed sample
selection, DETR tackles object detection as a direct set prediction problem.
Despite this attractiveness, DETR usually has many parameters and float-pointing op-
erations (FLOPs). For instance, 39.8M parameters comprise 159 MB memory usage and
86G FLOPs in the DETR model with ResNet-50 backbone [84] (DETR-R50). This leads
to unacceptable memory and computation consumption during inference and challenges
deployments on devices with limited resources.
Therefore, substantial efforts on network compression have been made toward efficient
online inference [264, 260]. Quantization is particularly popular for deploying AI chips by
representing a network in low-bit formats. Yet prior post-training quantization (PTQ) for
DETR [161] derives quantized parameters from pre-trained real-valued models, which often
restricts the model performance in a sub-optimized state due to the lack of fine-tuning
on the training data. In particular, the performance drastically drops when quantized to
ultra-low bits (4 bits or less). Alternatively, quantization-aware training (QAT) [158, 259]
performs quantization and fine-tuning on the training dataset simultaneously, leading to
trivial performance degradation even with significantly lower bits. Though QAT methods
have been proven to be very effective in compressing CNNs [159, 61] for computer vision
tasks, an exploration of low-bit DETR remains untouched.
In this paper, we first build a low-bit DETR baseline, a straightforward solution based
on common QAT techniques [61]. Through an empirical study of this baseline, we observe
significant performance drops on the VOC [62] dataset. For example, a 4-bit quantized
DETR-R50 using LSQ [61] only achieves 76.9% AP50, leaving a 6.4% performance gaps
compared with the real-valued DETR-R50. We find that the incompatibility of existing
QAT methods mainly stems from the unique attention mechanism in DETR, where the
spatial dependencies are first constructed between the object queries and encoded features.
Then a feed-forward network feeds the co-attended object queries into box coordinates
and class labels. A simple application of existing QAT methods on DETR leads to query
information distortion, and therefore the performance severely degrades. Figure 2.8 exhibits
an example of information distortion in query features of 4-bit DETR-R50, where we can see
significant distribution variation of the query modules in quantized DETR and real-valued
version. The query information distortion causes the inaccurate focus of spatial attention,
which can be verified by following [169] to visualize the spatial attention weight maps in 4-
bit and real-valued DETR-R50 in Fig. 2.9. We can see that the quantized DETR-R50 bear’s